feat: Add judge evaluation support to agent graphs by jsonbailey · Pull Request #142 · launchdarkly/python-server-sdk-ai

jsonbailey · 2026-04-24T15:18:40Z

Summary

Adds a new Evaluator class that coordinates per-node judge evaluation; evaluate() returns an asyncio.Task so evaluation fires immediately and is awaited before the graph run returns
AIAgentConfig (and AICompletionConfig) now carry a pre-built Evaluator as a kw_only dataclass field, constructed eagerly in client._build_evaluator()
LangGraph: LangGraphAgentGraphRunner stores per-node eval tasks in _pending_eval_tasks during node execution; LangChainCallbackHandler.flush() (now async) awaits them and calls track_judge_result via the same AIConfigTracker used for that node's LLM metrics
OpenAI: OpenAIAgentGraphRunner fires judge evaluation at handoff and final-segment points, tracked via the node's config tracker
Evaluator.noop() provides a null-object default so nodes without a judgeConfiguration require no special handling

Test plan

All existing unit tests pass (make test — 248 tests across 3 packages)
Lint passes (make lint)
Manual e2e: run langgraph-multi-agent-example or chat-judge-example via hello-python-ai pointing at this worktree and verify judge events appear in the LD events stream

Closes AIC-2267

🤖 Generated with Claude Code

Note

Medium Risk
Introduces asynchronous judge-evaluation execution and wires results into both ManagedModel and agent-graph runners, changing result types and tracker flushing behavior. Risk is moderate due to new concurrency/task handling and API surface changes around create_tracker and evaluations fields.

Overview
Adds a new Evaluator abstraction and threads it through AICompletionConfig/AIAgentConfig so judge evaluations can be kicked off per invocation and tracked automatically.

Updates ManagedModel.invoke() to start evaluation via the config’s evaluator, attach a completion callback to emit track_judge_result, and changes ModelResponse.evaluations to carry an asyncio.Task (while AgentGraphResult now includes collected judge results).

Extends LangGraph execution to schedule per-node evaluation tasks during node invocation, store them per-run using a ContextVar, and make LDMetricsCallbackHandler.flush() async so it can await tasks, track successful judge results per node, and return all results.

Refactors judge initialization in LDAIClient to build evaluators eagerly (including new default_ai_provider plumbing), removes async judge setup from create_model(), and tightens AgentGraphDefinition.create_tracker to be required; OpenAI agent-graph tracking is aligned to always use a graph tracker and now returns token usage in LDAIMetrics.

^{Reviewed by Cursor Bugbot for commit e2f5b93. Bugbot is set up for automated code reviews on this repo. Configure here.}

Adds per-node judge evaluation to agent graph execution. Each AIAgentConfig now carries a pre-built Evaluator (mirroring AICompletionConfig) that the provider-specific AgentGraphRunner invokes after each node's model response. Results are tracked via the same AIConfigTracker used for that node's LLM metrics, ensuring evaluation data is correlated correctly. Key changes: - New Evaluator class coordinating multiple judges; evaluate() returns an asyncio Task so evaluation fires immediately and is awaited in flush() - AIAgentConfig and AICompletionConfig carry an eager evaluator (kw_only field) - LangGraph runner stores per-node eval tasks in _pending_eval_tasks and flushes them via the callback handler's async flush() method - OpenAI runner fires judge evaluation at handoff and final-segment points - client._build_evaluator() handles empty/None judge config via Evaluator.noop() Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Remove quotes from asyncio.Task return type in Evaluator.evaluate() - Update ModelResponse.evaluations type to asyncio.Task[List[JudgeResult]] - Forward default_ai_provider to __evaluate_agent in create_agent Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

_pending_eval_tasks was keyed by node key, so repeated visits (e.g. cycles or tool loops) would silently overwrite earlier eval tasks. Changed to Dict[str, List[Task]] with setdefault/append so all invocations are tracked. flush() now iterates the full list per node. Also wraps the long __evaluate_agent call in create_agent to satisfy E501. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace asyncio.create_task fire-and-forget with proper task collection and awaiting in both OpenAI and LangGraph runners, ensuring judge results are tracked reliably. Use ContextVar in LangGraph runner to isolate pending eval task state across concurrent run() calls. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…_ai_provider - Remove if-tracker guards in both runners since create_tracker is always set on enabled graphs (disabled graphs are filtered before runner creation), also fixing token_usage NameError when tracker=None - Forward variables through _build_evaluator to _initialize_judges so judge templates can interpolate user-provided variables - Add default_ai_provider param to agent_graph() and forward it to __evaluate_agent so graph node evaluators use the correct provider; propagate from create_agent_graph() as well Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Redesign ManagedModel._track_judge_results to call evaluator.evaluate() internally and attach tracking via add_done_callback, returning the task so the reference is held by ModelResponse.evaluations — no GC risk - Warn instead of silently dropping eval tasks when the LangGraph ContextVar is unexpectedly unset in a node's execution context - Make AgentGraphDefinition.create_tracker a required parameter; all production and test call sites already supply it, and this matches the invariant that runners only execute on enabled (always-tracked) graphs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Both branches independently added evaluator/judge logic (this branch) and root-level tools map support (main). Conflicts in _completion_config and __evaluate_agent resolved by keeping both changes. Parameter order swap for track_metrics_of_async auto-resolved. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…iew items - Fix AgentGraphResult.evaluations type from Optional[List[Any]] to Optional[List[JudgeResult]] - Populate evaluations in both LangGraph and OpenAI runners with all judge results - Remove stray `if tracker:` guard in OpenAI _handle_handoff (tracker is always set) - Add comment documenting why output_text is empty at handoff time in OpenAI runner - flush() now returns List[JudgeResult] instead of None Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 2a15009. Configure here.}

- Add `from __future__ import annotations` to evaluator.py so the self-referential `-> Evaluator` return type does not need quoting - Log a warning when a judge fails to initialize in _initialize_judges instead of silently swallowing the exception Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The OpenAI Agents SDK does not expose a node's text output at handoff time, making it impossible to evaluate intermediate nodes against real output. Rather than evaluating against an empty string, remove evaluation support from the OpenAI runner entirely until the SDK provides a suitable API. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

keelerm84 approved these changes Apr 24, 2026

View reviewed changes

Comment thread packages/sdk/server-ai/src/ldai/evaluator.py Outdated

jsonbailey marked this pull request as ready for review April 24, 2026 20:03

jsonbailey requested a review from a team as a code owner April 24, 2026 20:03

jsonbailey changed the title ~~feat: add judge evaluation support to agent graphs~~ feat: Add judge evaluation support to agent graphs Apr 24, 2026

cursor Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread packages/sdk/server-ai/src/ldai/managed_model.py Outdated

Comment thread packages/sdk/server-ai/src/ldai/client.py

Comment thread packages/sdk/server-ai/src/ldai/client.py

cursor Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread packages/ai-providers/server-ai-langchain/src/ldai_langchain/langgraph_agent_graph_runner.py Outdated

cursor Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread packages/ai-providers/server-ai-openai/src/ldai_openai/openai_agent_graph_runner.py Outdated

Comment thread packages/ai-providers/server-ai-langchain/src/ldai_langchain/langgraph_agent_graph_runner.py Outdated